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Abstract 

We consider statistical and algorithmic aspects of 
solving large-scale least-squares (LS) problems 
using randomized sketching algorithms. Prior re¬ 
sults show that, from an algorithmic perspective, 
when using sketching matrices constructed from 
random projections and leverage-score sampling, 
if the number of samples r much smaller than 
the original sample size n, then the worst-case 
(WC) error is the same as solving the original 
problem, up to a very small relative error. From 
a statistical perspective, one typically considers 
the mean-squared error performance of random¬ 
ized sketching algorithms, when data are gener¬ 
ated according to a statistical linear model. In 
this paper, we provide a rigorous comparison of 
both perspectives leading to insights on how they 
differ. To do this, we hrst develop a framework 
for assessing, in a unihed manner, algorithmic 
and statistical aspects of randomized sketching 
methods. We then consider the statistical predic¬ 
tion efficiency (PE) and the statistical residual ef- 
hciency (RE) of the sketched LS estimator; and 
we use our framework to provide upper bounds 
for several types of random projection and ran¬ 
dom sampling algorithms. Among other results, 
we show that the RE can be upper bounded when 
r is much smaller than n, while the PE typically 
requires the number of samples r to be substan¬ 
tially larger. Lower bounds developed in subse¬ 
quent work show that our upper bounds on PE 
can not be improved. 


1. Introduction 

Recent work in large-scale data analysis and Randomized 
Linear Algebra (REA) has focused on developing so-called 
sketching algorithms: given a data set and an objective 
function of interest, construct a small “sketch” of the full 
data set, e.g., by using random sampling or random projec¬ 
tion methods, and use that sketch as a surrogate to perform 
computations of interest for the full data set (see (Mahoney, 
2011) for a review). Most effort in this area has adopted an 
algorithmic perspective, whereby one shows that, when the 
sketches are constructed appropriately, one can obtain an¬ 
swers that are approximately as good as the exact answer 
for the input data at hand, in less time than would be re¬ 
quired to compute an exact answer for the data at hand. 
Erom a statistical perspective, however, one is often more 
interested in how well a procedure performs relative to an 
hypothesized model than how well it performs on the par¬ 
ticular data set at hand. Thus an important question to con¬ 
sider is whether the insights from the algorithmic perspec¬ 
tive of sketching carry over to the statistical setting. To 
address this, in this paper, we develop a unihed approach 
that considers both the statistical perspective and algorith¬ 
mic perspective on recently-developed randomized sketch¬ 
ing algorithms in REA, and we provide bounds on two sta¬ 
tistical objectives for several types of random projection 
and random sampling sketching algorithms. 

1.1. Overview of the problem 

The problem we consider in this paper is the ordinary least- 
squares (LS or OLS) problem: given as input a matrix X S 
of observed features or covariates and a vector Y € 
R" of observed responses, return as output a vector /3 ols 
that solves the following optimization problem: 


Pols = arg min ||y - Xp\\l. 

ptMP 
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We will assume that n and p are both very large, with n ^ 
p, and for simplicity we will assume rank(X) = p, e.g., to 
ensure a unique full-dimensional solution. The LS solution, 
Pols = has a number of well-known de¬ 

sirable statistical properties (Chatterjee & Hadi, 1988); and 
it is also well-known that the running time or computational 
complexity for this problem is 0{np^) (Golub & Loan, 
1996).' For many modern applications, however, n may 
be on the order of 10® — 10® and p may be on the order of 
10® — 10^, and thus computing the exact LS solution with 
traditional 0{np^) methods can be computationally chal¬ 
lenging. This, coupled with the observation that approx¬ 
imate answers often suffice for downstream applications, 
has led to a large body of work on developing fast approx¬ 
imation algorithms to the LS problem (Mahoney, 2011). 

One very popular approach to reducing computation is to 
perform LS on a carefully-constructed “sketch” of the full 
data set. That is, rather than computing a LS estimator from 
Problem (1) from the full data (X, Y), generate “sketched 
data” {SX,SY) where S G with r ^ n, is a 

“sketching matrix,” and then compute a LS estimator from 
the following sketched problem: 

Ps = arg min ||S'y - SXP\\l. (2) 

Once the sketching operation has been performed, the addi¬ 
tional computational complexity of Ps is 0{rp^), i.e., sim¬ 
ply call a traditional LS solver on the sketched problem. 
Thus, when using a sketching algorithm, two criteria are 
important: first, ensure the accuracy of the sketched LS es¬ 
timator is comparable to, e.g., not much worse than, the 
performance of the original LS estimator; and second, en¬ 
sure that computing and applying the sketching matrix S 
is not too computationally intensive, e.g., that it is much 
faster than solving the original problem exactly. 

1.2. Prior results 

Random sampling and random projections provide two 
approaches to construct sketching matrices S that satisfy 
both of these criteria and that have received attention re¬ 
cently in the computer science community. In terms of 
running time guarantees, the running time bottleneck for 
random projection algorithms for the LS problem is the 
application of the projection to the input data, i.e., ac¬ 
tually performing the matrix-matrix multiplication to im¬ 
plement the projection and compute the sketch. By us¬ 
ing fast Hadamard-based random projections, however, 

'That is, O(np^) time suffices to compute the LS solu¬ 
tion from Problem (1) for arbitrary or worst-case input, with, 
e.g., the Cholesky Decomposition on the normal equations, with 
a QR decomposition, or with the Singular Value Decomposi¬ 
tion (Golub & Loan, 1996). 


Drineas et al. (Drineas et ah, 2011) developed a random 
projection algorithm that runs on arbitrary or worst-case 
input in o{np^) time. (See (Drineas et al., 2011) for a 
precise statement of the running time.) As for random 
sampling, Drineas et al. (Drineas et al., 2006; 2012) have 
shown that if the random sampling is performed with re¬ 
spect to nonuniform importance sampling probabilities that 
depend on the empirical statistical leverage scores of the 
input matrix X, i.e., the diagonal entries of the hat matrix 
H = then one obtains a random sam¬ 

pling algorithm that achieves much better results for arbi¬ 
trary or worst-case input. 

Leverage scores have a long history in robust statistics 
and experimental design. In the robust statistics com¬ 
munity, samples with high leverage scores are typically 
flagged as potential outliers (see, e.g., (Chatterjee & Hadi, 
2006; 1988; Hample et al., 1986; Hoaglin & Welsch, 1978; 
Huber & Ronchetti, 1981)). In the experimental design 
community, samples with high leverage have been shown 
to improve overall efficiency, provided that the underly¬ 
ing statistical model is accurate (see, e.g., (Royall, 1970; 
Zavlavsky et al., 2008)). This should be contrasted with 
their use in theoretical computer science. From the al¬ 
gorithmic perspective of worst-case analysis, that was 
adopted by Drineas et al. (Drineas et al., 2011) and Drineas 
et al. (Drineas et al., 2012), samples with high leverage 
tend to contain the most important information for subsam¬ 
pling/sketching. Thus it is beneficial for worst-case analy¬ 
sis to bias the random sample to include samples with large 
leverage scores or to rotate with a random projection to a 
random basis where the leverage scores are approximately 
uniformized. 

The running-time bottleneck for this leverage-based ran¬ 
dom sampling algorithm is the computation of the lever¬ 
age scores of the input data; and the obvious well-known 
algorithm for this involves 0{np^) time to perform a 
QR decomposition to compute an orthogonal basis for 
X (Golub & Loan, 1996). By using fast Hadamard-based 
random projections, however, Drineas et al. (Drineas et al., 
2012) showed that one can compute approximate QR 
decompositions and thus approximate leverage scores in 
o{np^) time; and (based on previous work (Drineas et al., 
2006)) this immediately implies a leverage-based random 
sampling algorithm that runs on arbitrary or worst-case in¬ 
put in o{np^) time (Drineas et al., 2012). Readers inter¬ 
ested in the practical performance of these randomized al¬ 
gorithms should consult Bendenpik (Avronet al., 2010) 
or LSRN (Meng et al., 2014). 

In terms of accuracy guarantees, both Drineas et 
al. (Drineas et ah, 2011) and Drineas et al. (Drineas et al., 
2012) prove that their respective random projection and 
leverage-based random sampling LS sketching algorithms 
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each achieve the following worst-case (WC) error guaran¬ 
tee: for any arbitrary (X, Y), 

||i" - xPsWl < (1 + - XPoLsWl, (3) 

with high probability for some pre-specified error param¬ 
eter K € (0,1). This 1 + K relative-error guarantee^ is 
extremely strong, and it is applicable to arbitrary or worst- 
case input. That is, whereas in statistics one typically as¬ 
sumes a model, e.g., a standard linear model on Y, 

Y = XP + e, (4) 

where [3 G Rf is the true parameter and e G K" is a 
standardized noise vector, with E[e] = 0 and E[ee^] = 
Inxn, in Drineas et al. (Drineas et al., 2011) and Drineas 
et al. (Drineas et al., 2012) no statistical model is assumed 
on X and Y, and thus the running time and quality-of- 
approximation bounds apply to any arbitrary {X, Y) input 
data. 

1.3. Our approach and main results 

In this paper, we address the following fundamental ques¬ 
tions. First, under a standard linear model, e.g., as given in 
Eqn. (4), what properties of a sketching matrix S are suffi¬ 
cient to ensure low statistical error, e.g., mean-squared er¬ 
ror? Second, how do existing random projection algorithms 
and leverage-based random sampling algorithms perform 
by this statistical measure? Third, how does this relate to 
the properties of a sketching matrix S that are sufficient to 
ensure low worst-case error, e.g., of the form of Eqn. (3), as 
has been established previously (Drineas et al., 2011; 2012; 
Mahoney, 2011)? We address these related questions in a 
number of steps. 

In Section 2, we will present a framework for evaluating the 
algorithmic and statistical properties of randomized sketch¬ 
ing methods in a unified manner; and we will show that 
providing WC error bounds of the form of Eqn. (3) and 
providing bounds on two related statistical objectives boil 
down to controlling different structural properties of how 
the sketching matrix S interacts with the left singular sub¬ 
space of the design matrix. In particular, we will consider 
the oblique projection matrix, Ilg = U{SUyS, where (-j^ 
denotes the Moore-Penrose pseudo-inverse of a matrix and 
U is the left singular matrix of X. This framework will 
allow us to draw a comparison between the WC error and 
two related statistical efficiency criteria, the statistical pre¬ 
diction efficiency (PE) (which is based on the prediction 
error E[||2f(/3 — /3)||2] and which is given in Eqn. (7) be¬ 
low) and the statistical residual efficiency (RE) (which is 
based on residual error E[||F — 2f/?||2] and which is given 

^The nonstandard parameter k is used here for the error pa¬ 
rameter since e is used below to refer to the noise or error process. 


in Eqn. (8) below); and it will allow us to provide sufficient 
conditions that any sketching matrix S must satisfy in order 
to achieve performance guarantees for these two statistical 
objectives. 

In Section 3, we will present our main theoretical results, 
which consist of bounds for these two statistical quan¬ 
tities for variants of random sampling and random pro¬ 
jection sketching algorithms. In particular, we provide 
upper bounds on the PE and RE (as well as the worst- 
case WC) for four sketching schemes: (1) an approxi¬ 
mate leverage-based random sampling algorithm, as is an¬ 
alyzed by Drineas et al. (Drineas et al., 2012); (2) a vari¬ 
ant of leverage-based random sampling, where the random 
samples are not re-scaled prior to their inclusion in the 
sketch, as is considered by Ma et al. (Ma et al., 2015); (3) a 
vanilla random projection algorithm, where S' is a random 
matrix containing i.i.d. Gaussian or Rademacher random 
variables, as is popular in statistics and scientific comput¬ 
ing; and (4) a random projection algorithm, where S is a 
random Hadamard-based random projection, as analyzed 
in (Boutsidis & Gittens, 2013). Eor sketching schemes (1), 
(3), and (4), our upper bounds for each of the two measures 
of statistical efficiency are identical up to constants; and 
they show that the RE scales as 1 -f while the PE scales 
as In particular, this means that it is possible to obtain 
good bounds for the RE when p < r <C n (in a manner 
similar to the sampling complexity of the WC bounds); but 
in order to obtain even near-constant bounds for PE, r must 
be at least of constant order compared to n. Eor the sketch¬ 
ing scheme (2), we show, on the other hand, that under the 
(strong) assumption that there are k “large” leverage scores 
and the remaining n — k are “small,” then the WC scales 
as 1 -f the RE scales as 1 -f and the PE scales as 
That is, sharper bounds are possible for leverage-score 
sampling without re-scaling in the statistical setting, but 
much stronger assumptions are needed on the input data. 
We also present a lower bound developed in subsequent 
work by Pilanci and Waniwright (Pilanci & Wainwright, 
2014) which shows that under general conditions on S, the 
upper bound of ^ for PE can not be improved. Hence our 
upper bounds in Section 3 on PE can not be improved. 

In Section 4, we will provide a brief discussion and con¬ 
clusion. Eor space reasons, we do not include in this 
conference version the proofs of our main results or our 
empirical results that support our theoretical findings; but 
they are included in the technical report version of this pa¬ 
per (Raskutti &. Mahoney, 2014). 

1.4. Additional related work 

Very recently, Ma et al. (Ma et al., 2015) considered statis¬ 
tical aspects of leverage-based sampling algorithms (called 
algorithmic leveraging in (Ma et al., 2015)). Assuming 
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a standard linear model on V of the form of Eqn. (4), 
the authors developed first-order Taylor approximations to 
the statistical RE of different estimators computed with 
leverage-based sampling algorithms, and they verified the 
quality of those approximations with computations on real 
and synthetic data. Taken as a whole, their results sug¬ 
gest that, if one is interested in the statistical performance 
of these randomized sketching algorithms, then there are 
nontrivial trade-offs that are not taken into account by stan¬ 
dard WC analysis. Their approach, however, does not im¬ 
mediately apply to random projections or other more gen¬ 
eral sketching matrices. Eurther, the realm of applicabil¬ 
ity of the first-order Taylor approximation was not pre¬ 
cisely quantified, and they left open the question of struc¬ 
tural characterizations of random sketching matrices that 
were sufficient to ensure good statistical properties on the 
sketched data. We address these issues in this paper. 


dom sampling or random projection operations, for now 
we let S be any r x n matrix. Then, we are interested in 
analyzing the performance of objectives characterizing the 
quality of a “sketched” LS objective, as given in Eqn (2), 
where again we are interested in solutions of the form 

Ps = {SX)^SY. (6) 

(We emphasize that this does not in general equal 
{(SX)"^SX)~^{SX)"’"SY, since the inverse will not exist 
if the sketching process does not preserve rank.) Our goal 
here is to compare the performance of Ps to Pols- We 
will do so by considering three related performance crite¬ 
ria, two of a statistical flavor, and one of a more algorithmic 
or worst-case flavor. 

Erom a statistical perspective, it is common to assume a 
standard linear model on Y, 


Subsequent work by Pilanci and Wain- 
wright (Pilanci & Wainwright, 2014) also considers a 
statistical perspective of sketching. Amongst other results, 
they develop a lower bound which confirms that using 
a single randomized sketching matrix S can not achieve 
a better PE than This lower bound complements the 
upper bounds developed in this paper. Their main focus is 
to use this insight to develop an iterative sketching scheme 
which yields bounds on the SPE when an r x n sketch is 
applied repeatedly. 

2. General framework and structural results 


Y = Xp + e, 

where we remind the reader that /3 € is the true parame¬ 
ter and e S K." is a standardized noise vector, with E[e] = 0 
and E[ee^] = Inxn- Erom this statistical perspective, we 
will consider the following two criteria. 


• The first statistical criterion we consider is the predic¬ 
tion efficiency (PE), defined as follows: 


Cpe{S) 


E[||X(^-fe)||i] 

E[||X(/?-/3oL5)|li]’ 


(7) 


In this section, we develop a framework that allows us 
to view the algorithmic and statistical perspectives on LS 
problems from a common perspective. We then use this 
framework to show that existing worst-case bounds as well 
as our novel statistical bounds for the mean-squared er¬ 
rors can be expressed in terms of different structural condi¬ 
tions on how the sketching matrix S interacts with the data 

(x,y). 


where the expectation E[-] is taken over the random 
noise e. 


• The second statistical criterion we consider is the 
residual efficiency (RE), defined as follows: 


E[||y-xfe||i] 

^ n\Y-xpoLs\\iy 


( 8 ) 


2.1. A statistical-algorithmic framework 

Recall that we are given as input a data set, {X,Y) S 
]^nxp ^ j-jjg objective function of interest is the 

standard LS objective, as given in Eqn. (1). Since we are 
assuming, without loss of generality, that rank(A) = p, we 
have that 

Pols = X^Y = {X^xy^X^Y, (5) 

where ( denotes the Moore-Penrose pseudo-inverse of a 
matrix. 

To present our framework and objectives, let S G 
denote an arbitrary sketching matrix. That is, although we 
will be most interested in sketches constructed from ran¬ 


where, again, the expectation E[-] is taken over the 
random noise e. 

Recall that the standard relative statistical efficiency for two 
estimators Pi and P 2 is defined as eff(/3i,/32) = ^ 

where Var( ) denotes the variance of the estimator (see, 
e.g., (Lehmann, 1998)). Eor the PE, we have replaced the 
variance of each estimator by the mean-squared prediction 
error. Eor the RE, we use the term residual since for any 
estimator /?, F — 2f/3 are the residuals for estimating Y. 

Erom an algorithmic perspective, there is no noise process 
e. Instead, X and Y are arbitrary, and P is simply computed 
from Eqn (5). To draw a parallel with the usual statistical 
generative process, however, and to understand better the 
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relationship between various objectives, consider “defin¬ 
ing” Y in terms of X by the following “linear model”; 


Y = X/3 + e, 

where /3 G and e G M". Importantly, /3 and e here repre¬ 
sent different quantities than in the usual statistical setting. 
Rather than e representing a noise process and /3 represent¬ 
ing a “true parameter” that is observed through a noisy Y, 
here in the algorithmic setting, we will take advantage of 
the rank-nullity theorem in linear algebra to relate X and 
Y? To define a “worst case model” Y = Xp + e for 
the algorithmic setting, one can view the “noise” process 
e to consist of any vector that lies in the null-space of 
Then, since the choice of /3 G is arbitrary, one can con¬ 
struct any arbitrary or worst-case input data Y. From this 
algorithmic case, we will consider the following criterion. 


• The algorithmic criterion we consider is the worst- 
case (WC) error, defined as follows; 


Cwc{S) = sup 
y 


\\Y-XPs\\l 

\\Y-XpoLs\\l' 


(9) 


This criterion is worst-case since we take a supremum 
Y, and it is the performance criterion that is analyzed 
in Drineas et al. (Drineas et ah, 2011) and Drineas et 
al. (Drineas et al., 2012), as bounded in Eqn. (3). 

Writing Y as X/3-|-e, where X^e = 0, the worst-case error 
can be re-expressed as; 


Cwc{S) 


\\Y-XPs\\l 

Y=xp+^xre=0 11 ^ “ XfioLsWi 


Hence, in the worst-case algorithmic setup, we take a 
supremum over e, where X^e = 0, whereas in the statisti¬ 
cal setup, we take an expectation over e where E[e] = 0. 

Before proceeding, several other comments about this 
algorithmic-statistical framework and our objectives are 
worth mentioning. 


• C}iE{S) is a statistical analogue of the worst-case al¬ 
gorithmic objective Cwc{S), since both consider the 

ratio of the metrics nV' ■ The difference is 

II L yOLS II 2 

that a sup over Y in the algorithmic setting is replaced 
by an expectation over noise e in the statistical set¬ 
ting. A natural question is whether there is an algorith¬ 
mic analogue of Cpe{S). Such a performance metric 
would be; 


wxjp-psm 

7 \\X{l3-f3oLs)\\r 


( 10 ) 


where /3 is the projection of Y on to the range space 
of X^. However, since Pols = P + (X^X)~^X^e 
and since X^e = 0, Pols = P tn the algorithmic 
setting, the denominator of Eqn. (10) equals zero, and 
thus the objective in Eqn. ( 10) is not well-defined. The 
“difficulty” of computing or approximating this objec¬ 
tive parallels our results below that show that approx¬ 
imating Cpe{S) is much more challenging (in terms 
of the number of samples needed) than approximating 
Cre{S). 


• In the algorithmic setting, the sketching matrix S and 
the objective C\yc{S) can depend on X and Y in any 
arbitrary way, but in the following we consider only 
sketching matrices that are either independent of both 
X and Y or depend only on X (e.g., via the statisti¬ 
cal leverage scores of X). In the statistical setting, S 
is allowed to depend on X, but not on Y, as any de¬ 
pendence of 5 on F might introduce correlation be¬ 
tween the sketching matrix and the noise variable e. 
Removing this restriction is of interest, especially in 
light of the recent results that show that one can ob¬ 
tain WC bounds of the form Eqn. (3) by constructing 
S by randomly sampling according to an importance 
sampling distribution that depends on the influence 
scores —essentially the leverage scores of the matrix 
X augmented with —Y as an additional column—of 
the (X, Y) pair. 


• Erom the perspective of our two linear models, we 
have that Pols = P + (X^X)“^X^e. In the sta¬ 
tistical setting, since E[ee^] = /„xn^ it follows that 
Pols is a random variable with ¥\Pols] = P and 
E[(/3 - PoLs){P - PoLsV] = (X^X)-I. In the al¬ 
gorithmic setting, on the other hand, since X^e = 0, 
it follows that Po ls = P- 

^The rank-nullity theorem asserts that given any matrix X G 
and vector Y G R", there exists a unique decomposition 
Y = XP-\-t, where P is the projection of Y on to the range space 
of X^ and e = Y — XP lies in the null-space of X^ (Meyer, 
2000 ). 


• Both Cpe{S) and Cpe{S) are qualitatively re¬ 
lated to quantities analyzed by Ma et al. (Ma et al., 
2015). In addition, Cwc{SI) is qualitatively similar to 
Cov(/3|F) in Ma et al., since in the algorithmic set¬ 
ting Y is treated as fixed; and Cpe{S) is qualitatively 
similar to Cov(/3) in Ma et al., since in the statistical 
setting Y is treated as random and coming from a lin¬ 
ear model. That being said, the metrics and results 
we present in this paper are not directly comparable 
to those of Ma et al. since, e.g., they had a slightly 
different setup than we have here, and since they used 
a first-order Taylor approximation while we do not. 
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2.2. Structural results on sketching matrices 

We are now ready to develop structural conditions char¬ 
acterizing how the sketching matrix S interacts with the 
data X ; this will allow us to provide upper bounds for the 
quantities Cwc{S), Cpe{S), and Cre{S). To do this, re¬ 
call that given the data matrix X, we can express the sin¬ 
gular value decomposition of X as X = , where 

U € is an orthogonal matrix, i.e., U'^U = /pxp- In 

addition, we can define the oblique projection matrix 

11% ■=U{SU)'<S. (11) 

Note that if rank(S'X) = p, then 11^ can be expressed 
as = U{U'^S'^SU)-^U'^S'^S, since U'^S'^SU is in¬ 
vertible. Importantly however, depending on the properties 
of X and how S is constructed, it can easily happen that 
rank(S'X) < p, even if rank(X) = p. 

Given this setup, we can now state the following lemma, 
which characterizes how Cwc{S), Cpe{S), and Cpe{S) 
depend on different structural properties of Ilg and SU. 

Lemma 1. For the algorithmic setting. 


Cwg{S) = 1 - 1 - 


sup 

(5eKP,C/Te=0 


II (^pxp 


(5f/)t(5f/))<5||i 

Iklli 



For the statistical setting, 

^ \\{ip^p-{su)^su)^v^p\\l ^ lin^lll ^ 

p p 

and 

n/p — 1 

Several points are worth making about Lemma 1 . 

• For all 3 criteria, the term which involves {SU)^SU 
is a “bias” term that is non-zero in the case that 
rank(5'f/) < p. For CpE^S) and Cre{S), the 
term corresponds exactly to the statistical bias; and 
if rank(5'?7) = p, meaning that S' is a rank-preserving 
sketching matrix, then the bias term equals 0, since 
{SUy SU = Ipxp- In practice, if r is chosen smaller 
than p or larger than but very close to p, it may happen 
that rank(SC/) < p, in which case this bias is incurred. 

• The final equality Cre{S) = 1-1- shows 

that in general it is much more difficult (in terms 
of the number of samples needed) to obtain bounds 
on Cpe{S) than Cre{S) —since Cre{S) re-scales 
CpEiS) hy pIn, which is much less than 1. This will 
be reflected in the main results below, where the scal¬ 
ing of Cre{S) will be a factor of p/n smaller than 


Cpe{S). In general, it is significantly more diffi¬ 
cult to bound CpEiS), since ||X(/3 — /Jols)!!! I® 
whereas \\Y — X[3ols\\2 so there is much 

less margin for error in approximating Cpe{S). 

• In the algorithmic or worst-case setting, 
sup,gR„/{o},nf^e=o is the relevant quan¬ 

tity, whereas in the statistical setting ||ng|||, is the 
relevant quantity. The Frobenius norm enters in the 
statistical setting because we are taking an average 
over homoscedastic noise, and so the £2 norm of the 
eigenvalues of 11^ need to be controlled. On the other 
hand, in the algorithmic or worst-case setting, the 
worst direction in the null-space of C/^ needs to be 
controlled, and thus the spectral norm enters. 

3. Main theoretical results 

In this section, we provide upper bounds for Cwc{S), 
Cpe{S), and Cre{S), where S correspond to random 
sampling and random projection matrices. In particu¬ 
lar, we provide upper bounds for 4 sketching matrices: 
(1) a vanilla leverage-based random sampling algorithm 
from Drineas et al. (Drineas et ah, 2012); (2) a variant of 
leverage-based random sampling, where the random sam¬ 
ples are not re-scaled prior to their inclusion in the sketch; 
(3) a vanilla random projection algorithm, where S' is a 
random matrix containing i.i.d. sub-Gaussian random vari¬ 
ables; and (4) a random projection algorithm, where S is 
a random Hadamard-based random projection, as analyzed 
in (Boutsidis & Gittens, 2013). 

3.1. Random sampling methods 

Here, we consider random sampling algorithms. To do so, 
first define a random sampling matrix S G M" as follows: 
Sij € {0,1} for all (z,j) and = 1’ where each 

row has an independent multinomial distribution with prob¬ 
abilities (pi)?=i- The matrix of cross-leverage scores is de¬ 
fined as L = UU'^ € and £i = La denotes the 

leverage score corresponding to the sample. Note that 
the leverage scores satisfy “ trace(L) = p and 

0<£i<l. 

The sampling probability distribution we consider {pi)2=i 
is of the form Pi = (1—0)^ +0qi, where satisfies 

0 < < 1 and Y%i=i = 1 is an arbitrary probability 

distribution, and 0 < 9 < 1. In other words, pi is a convex 
combination of a leverage-based distribution and another 
arbitrary distribution. Note that for 6 = 0, the probabilities 
are proportional to the leverage scores, whereas for 9=1, 
the probabilities follow the distribution defined by {qi}”=i. 

We consider two sampling matrices, one where the ran¬ 
dom sampling matrix is re-scaled, as in Drineas et 
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al. (Drineas et al., 2011), and one in which no re-scaling 
takes place. In particular, let Snr = S denote the random 
sampling matrix (where the subscript NR denotes the fact 
that no re-scaling takes place). The re-scaled sampling ma¬ 
trix is Sr e = SW, where W G is a diagonal 
re-scaling matrix, where [W]jj = and Wji = 0 for 

j ^ i. The quantity ^ is the re-scaling factor. 

In this case, we have the following result. 

Theorem 1. For S = Sr, with r > log ( 
then with probability at least 0.7, it holds that 
rank{S rU) = p and that: 

Cwc{Sr) < 1 - 1 - 12 - 

r 

71 

Cpe{Sr) < 44— 

r 

Cre{Sr) < 1-I-44-. 

r 

Several things are worth noting about this result. First, note 
that both Cwc{Sr) — 1 and Cre{Sr) — 1 scale as thus, 
it is possible to obtain high-quality performance guarantees 
for ordinary least squares, as long as ^ ^ 0, e.g., if r is 

only slightly larger than p. On the other hand, Cpe{Sr) 
scales as meaning r needs to be close to n to provide 
similar performance guarantees. Next, note that all of the 
upper bounds apply to any data matrix X, without assum¬ 
ing any additional structure on X. 

Also note that the distribution does not enter the 

results which means our bounds hold for any choice of 
™d don’t depend on 6. This allows to consider 
different distributions. A standard choice is uniform, i.e., 
gj = i (see e.g. Ma et al. (Maetal., 2015)). The other 
important example is that of approximate leverage-score 
sampling developed in (Drineas et al., 2012) that reduces 
computation. Let denote the approximate leverage 

scores developed by the procedure in (Drineas et al., 2012). 
Based on Theorem 2 in (Drineas et al., 2012), — < 0 

where 0 < 0 < 1 for r sufficiently large. Now, using 
Pi = Pi can be re-expressed as = (1 — 0)^ -|- 9qi 

where {qi)2=i is a distribution (unknown since we only 
have a bound on the approximate leverage scores). Hence, 
the performance bounds achieved by approximate leverag¬ 
ing are equivalent to those achieved by adding 0 multiplied 
by a uniform or other arbitrary distribution. 

Next, we consider the leverage-score estimator without re¬ 
scaling Snr- In order to develop nontrivial bounds on 
Cwc{Snr), Cpe{Snr), and Cre{Snr), we need to 
make a (strong) assumption on the leverage-score distri¬ 
bution on X. To do so, we define the following. 

Definition 1 (k-heavy hitter leverage distribution). A se¬ 
quence of leverage scores is a k-heavy hitter lever¬ 

age score distribution if there exist constants c, C > 0 such 


that for l<i<k, ^<£i<^ and for the remaining 
n — k leverage scores, Y^^=k+i Si \- 

The interpretation of a fc-heavy hitter leverage distribution 
is one in which only k samples in X contain the major¬ 
ity of the leverage score mass. The parameter k acts as a 
measure of non-uniformity, in that the smaller the k, the 
more non-uniform are the leverage scores. The fc-heavy 
hitter leverage distribution allows us to model highly non- 
uniform leverage scores which allows us to state the fol¬ 
lowing result. 

Theorem 2. For S = Snr, with 0 = 0 and as¬ 
suming a k-heavy hitter leverage distribution and r > 
ciplog(c 2 p), then with probability at least 0.6, it holds 
that rank(SNR) = P and that: 


Cwci^NR) 

< 

AAC^p 

^ 2 

r 

CpEi^NR) 

< 

AAC^ k 


CREiSNR) 

< 

, AAC'^ pk 

1+2 

nr 

Notice that when k <C 

n, 

bounds in Theorem 2 on 


Cpe{Snr) and Cre{Snr) are significantly sharper than 
bounds in Theorem 1 on Cpe{Sr) and Cre{Sr). Hence 
not re-scaling has the potential to provide sharper bound in 
the statistical setting. However a much stronger assumption 
on X is needed for this result. 

3.2. Random projection methods 

Here, we consider two random projection algorithms, one 
based on a sub-Gaussian projection matrix and the other 
based on a Hadamard projection matrix. To do so, de¬ 
fine [S'sGp]^ = where {X,j)i<i<r,i<j<n arei.i.d. 

sub-Gaussian random variables with £[27^] = 0, variance 
= cr^ and sub-Gaussian parameter 1. In this case, 
we have the following result. 

Theorem 3. For any matrix X, there exists a constant c 
such that if r > c' log n, then with probability greater than 
0.7, it holds that rank{SsGp) = P <^nd that: 

Cwc{SsGp) < 1 -I- 11- 

r 

77 

Cpe{Ssgp) < 44(1-1—) 

r 

CREiSsGp) S 1-I-44-. 

r 

Notice that the bounds in Theorem 3 for Srgp are equiv¬ 
alent to the bounds in Theorem 1 for Sr, except that r 
is required only to be larger than O(logn) rather than 
O(plogp). Hence for smaller values of p, random sub- 
Gaussian projections are more stable than leverage-score 
sampling based approaches. This reflects the fact that to 
a first-order approximation, leverage-score sampling per¬ 
forms as well as performing a smooth projection. 
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Next, we consider the randomized Hadamard projec¬ 
tion matrix. In particular, SHad = SunifHD, where 
H € is the standard Hadamard matrix (see 

e.g. (Hedayat & Wallis, 1978)), Sunif G is an r x n 
uniform sampling matrix, and D G is a diagonal 

matrix with random equiprobable ±1 entries. 

Theorem 4. For any matrix X, there exists a constant c 
such that if r > cp log n(logp -f log log n), then with prob¬ 
ability greater than 0.8, it holds that rank^Snad) = P ond 
that: 


CwciSnad) 

< 

1 + 40 \og(np)- 
r 

CREiSpad) 

< 

77 

401og(np)(l -1—) 
r 

CpEiSHad) 

< 

1 -f 401og(np)(l -f -) 


Notice that the bounds in Theorem 4 for Suad are equiv¬ 
alent to the bounds in Theorem 1 for Sr, up to a con¬ 
stant and log(np) factor. As discussed in Drineas et 
al. (Drineas et al., 2011), the Hadamard transformation 
makes the leverage scores of X approximately uniform (up 
to log(np) factor), which is why the performance is similar 
to the sub-Gaussian projection (which also tends to make 
the leverage scores of X approximately uniform). We sus¬ 
pect that the additional log(np) factor is an artifact of the 
analysis since we use an entry-wise concentration bound; 
using more sophisticated techniques, we believe that the 
log(np) can be improved. 

3.3. Lower Bounds 

In concurrent work, Pilanci and Wain- 
wright (Pilanci & Wainwright, 2014) amongst other 
results develop lower bounds on the numerator in Cpe{S) 
which prove that our upper bounds on Cpe{S) can not be 
improved. We re-state Theorem 1 (Example 1) in Pilanci 
and Wainwright (Pilanci & Wainwright, 2014) in a way 
that makes it most comparable to our results. 

Theorem 5 (Theorem 1 in (Pilanci & Wainwright, 
2014)). For any sketching matrix satisfying 
Hop < any estimator based on 

{SX, SY) satisfies the lower bound with probability 
greater than 1 /2: 


Cpe{S) > 


n 

128pr 


Gaussian and Hadamard projections as well as re-weighted 
approximate leverage-score sampling all satisfy the condi¬ 
tion ||E[S'^(S'S'^)“^S']||op < On the other hand un¬ 
weighted leverage-score sampling does not satisfy this con¬ 
dition and hence does not satisfy the lower bound which is 
why we are able to prove a tighter upper bound when the 
matrix X has highly non-uniform leverage scores. This 


proves that Cpe{S) is a quantity that is more challeng¬ 
ing to control than Cre{S) and Cwc{S) when only a sin¬ 
gle sketch is used. Using this insight, Pilanci and Wain¬ 
wright (Pilanci & Wainwright, 2014) show that by using a 
particular iterative Hessian sketch, Cpe{S) can be con¬ 
trolled up to constant. In addition to providing a lower 
bound on the PE using a sketching matrix just once, Pi¬ 
lanci and Wainwright also develop a new iterative sketc- 
thing scheme where sketching matrices are used repeatedly 
can reduce the PE significantly. 

4. Discussion and conclusion 

In this paper, we developed a framework for analyzing al¬ 
gorithmic and statistical criteria for general sketching ma¬ 
trices S € applied to the LS objective. Our frame¬ 

work reveals that the algorithmic and statistical criteria de¬ 
pend on different properties of the oblique projection ma¬ 
trix Hg = U{SUyU, where U is the left singular ma¬ 
trix for X. In particular, the algorithmic WC criteria de¬ 
pends on the quantity supj/Tg^Q since in that case 

the data may be arbitrary and worst-case, whereas the two 
statistical criteria (RE and PE) depends on HHg since in 
that case the data follow a linear model with homogenous 
noise variance. 

Using our framework we develop upper bounds for three 
performance criterion applied to 4 sketching schemes. Our 
upper bounds reveal that in the regime where p < r n, 
our sketching schemes achieve optimal performance up to 
constant in terms of WC and RE. On the other hand, the 
PE scales as meaning r needs to be close to n for good 
performance; and subsequent lower bounds in Pilanci and 
Wainwright (Pilanci & Wainwright, 2014) show that this 
upper bound can not be improved. 
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